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NONLINEAR  PROGRAMMING  FOR  IARGE,  SPARSE  SYSTEMS 


B.  A.  Murtagh  and  M.  A.  Saunders 


ABSTRACT 


An  algorithm  for  solving  large-scale  nonlinear  programs  with 
linear  constraints  is  presented.  The  method  combines  efficient  sparse- 
matrix  techniques  as  in  the  revised  simplex  method  with  stable  variable- 
metric  methods  for  handling  the  nonlinearities.  A general-purpose 
production  code  (MINOS)  is  described,  along  with  computational  experience 
on  a wide  variety  of  problems. 


NONLINEAR  PROGRAMMING  FOR  LARGE,  SPARSE  SYSTEMS* 
B.  A.  Murtagh  and  M.  A.  Saunders 


1.  Introduction 

This  paper  describes  our  efforts  to  develop  a nonlinear  programming 
algorithm  for  problems  characterized  by  a large  sparse  set  of  linear  con- 
straints and  a significant  degree  of  nonlinearity  in  the  objective  function. 
It  has  been  our  experience  that  many  linear  programming  problems  are 
inordinately  large  because  they  are  attempting  to  approximate,  by  piecewise 
linearization,  what  is  essentially  a nonlinear  problem.  It  also  appears 
that  many  real-life  problems  are  such  that  only  a small  percentage  of  the 
variables  are  involved  nonlinear ly  in  the  objective  function.  Thus  we  are 
led  to  consider  problems  which  have  the  following  canonical  form: 


minimize 


F(x)  = f(xW)  + 


T 

c x 


subject  to  Ax  = b 

f,  < X < u 


(1) 

(2) 

(5) 


where  A is  m x n,  m < n.  We  partition 

N 

and  a nonlinear  portion  x : 


x = 


x 


into  a linear  portion 


L 


x 


ft 

An  earlier  version  of  this  paper  was  presented  at  the  8th  Int.  Symp.  Math. 
Prog.,  Stanford,  California,  August  1975. 
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will  normally  be  called  the  nonlinear  variables. 


The  components  of  x' 

Note  that  A and  c operate  on  all  variables  x.  In  some  cases  the 

II 

cart  of  cTx  involving  x may  be  incorporated  into  f (x  ) ; in  other 
cases  c may  be  zero.  We  assume  that  the  function  f(x  ) is  continuously 
differentiable  in  the  feasible  region,  with  gradient 

Vf(xN)  = g(xN). 


The  research  work  reported  here  was  stimulated  by  some  of  the 
deficiencies  in  the  algorithm  of  Murtagh  ana  Sargent  [44],  [50],  especially 
when  applied  to  large-scale  systems.  The  resulting  algorithm  is  related 
to  the  reduced-gradient  method  of  Wolfe  [56]  and  the  variable-reduction 
method  of  McCormick  [4lJ,  [42].  It  also  draws  much  from  the  unconstrained 
and  linearly-constrained  optimization  methods  of  Gill  and  Murray  [21],  [22], 
[25]. 

In  essence  the  algorithm  is  an  extension  of  the  revised  simplex 
method  (Dantzig  [12]).  To  use  some  of  the  associated  terminology,  it 
might  be  described  as  an  extension  which  permits  more  than  m variables 
to  be  basic.  Because  of  the  close  ties  with  linear  programming  (LP)  we 
have  been  able  to  incorporate  into  our  implementation  many  of  the  recent 
advances  in  LP  technology.  The  result  is  a computer  program  which  has 
many  of  the  capabilities  of  an  efficient  LP  code  and  is  also  able  to 
deal  with  nonlinear  terms  with  the  power  of  a quasi-Newton  procedure. 
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1.1  Notation  and  Terminology 


Partitioning  x and  F(x)  into  linear  and  nonlinear  terms 
is  of  considerable  practical  importance;  for  descriptive  purposes 
h -Tever , it  is  convenient  to  denote  F(x)  and  VF(x)  = g/x)  + c 
simply  by  f(x)  and  g(x). 

With  a few  conventional  exceptions , we  use  upper-case  letters 
for  matrices,  lower-case  for  vectors  and  Greek  lower-case  for  scalars. 
The  quantity  e > 0 represents  the  precision  of  floating-point 
arithmetic. 

The  terms  "variable-metric"  and  "quasi -Newton"  will  be 
used  synonymously,  as  will  the  adjectives  "reduced"  and  "projected." 
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2. 


Basis  of  the  Method 


2.X.  Variable-metric  projection 

Before  turning  to  the  canonical  form,  consider  the  problem 


minimise 

f(x)  = f(xL,  ...  , xn)  . 

(4) 

subject  to 

Ax  < b 

fSJ  m wm 

(5) 

We  will  assume  that  f(x)  can  be  expanded  in  a Taylor^s  series 
with  remainder  of  second  order: 


where 

* ^+1  - Sk 

«k  =vf(xk) 

and  G(xk  + yA^)  is  the  Hessian  matrix  of  second  partial  derivatives 
evaluated  at  some  point  between  and  Note  that  G is  a 

constant  matrix  if  f(x)  is  a quadratic  function. 

Suppose  that  the  current  point,  x^,  is  on  the  boundary  of  m 
constraints  (m  < n).  Denoting  the  corresponding  m rows  of  A by 
the  submatrix  B we  have: 


(7) 


Variable -metric  projection  methods  (Goldfarb  [30],  Murtagh  and 
Sargent  [44])  are  based  on  two  properties  required  of  the  step  Ax^ 


Property  1. 

gk  + G(xk)  &xk  = b\ 


(8) 


4 


i.e.,  the  step  Axk  is  to  a stationary  point  on  the  surface  of  the  m 
active  constraints-  The  gradient  of  f(x)  at  (given  by  the  left 

hand  side  of  equation  (8)  if  is  suffic5  ;ntly  close  to  the  stationary 
point  for  a quadratic  approximation  to  be  valid)  is  orthogonal  to  the 
surface  and  thus  denoted  by  a linear  combination  of  the  constraint  normals 
(given  by  the  right  hand  side  of  equation  (8)). 

Property  2. 

“ 0 (9) 

i.e.  the  step  remains  on  the  surface  given  by  the  intersection  of  the  m 
active  constraints. 

The  implementation  put  forward  by  Murtagh  and  Sargent  [44] 
used  these  two  properties  to  produce  a step  given  by 

■ A\(«k  - ®V  (10) 


where  7^.  (an  estimate  of  the  Lagrange  multipliers  for  B)  is  obtained 
by  substituting  equation  (9)  into  (8): 

lk  = (BSkBT)-3  BSkgk  , (11) 

Vy  is  a step-size  parameter  used  to  adjust  the  length  of  the  step  Axk, 
and  S,  is  a variable -metric  approximation  to  G”1(x.  ).  A rank-1 

K — „[ 

updating  procedure  is  used  to  modify  Sk  each  iteration.  This  allows 
the  matrix  (BSkB  ) to  be  updated  by  a rank-1  correction  also.  Note 
however  that  no  advantage  is  taken  of  either  B or  G being  sparse, 
since  the  matrices  G-1  and  (BG”^)"1  are  in  general  dense. 
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The  procedure  works  well  in  many  cases  (see  [50]),  but  storage 
limitations  prevent  application  to  large  problems,  quite  regardless  of 
numerical  accuracy  in  the  updating  procedures.  The  motivation  for  the 
present  work,  therefore,  is  to  use  Property  1 and  Property  2 in  a more 
efficient  manner,  particularly  for  the  case  of  large  sparse  systems  with 
relatively  few  nonlinear  variables. 


2.2.  Extension  to  Large  Sparse  Systems 

We  return  now  to  the  canonical  form  of  equations  (l)-(5).  The 
key  to  success  with  the  simplex  method  lay  in  adopting  such  a canonical 
form  and  working  with  so-called  hasp c feasible  solutions,  which  are 
characterized  by  having  n-m  "nonbasic"  variables  equal  to  their  upper  or 
lower  bound.  With  nonlinear  problems  we  cannot  expect  an  optimal  solution 
to  be  of  this  kind.  As  a simple  generalization  we  introduce  the  notion 
of  "superbasic"  variables  and  partition  the  set  of  general  constraints  (2) 
as  follows: 


m s n-m-s 


*1 


*2 


= b 


(12) 
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The  matrix  B.^  is  square  and  nonsingular  and  corresponds  to  the  usual 
basis  matrix  of  the  simplex  method;  is  m vs  with  0<s<n-m,  and 

the  associated  variables  xg  are  called  the  superbasics  as  shown.  Both 
basics  and  superbasics  are  free  to  vary  between  their  bounds. 

The  motivation  for  this  partitioning  is  provided  by  the  following: 

Theorem  1.  If  a nonlinear  program  has  an  optimal  solution  and  if  it  involves 
t variables  nonlinear ly,  an  optimal  solution  exists  in  which  the  number  of 
superbasic  variables  s satisfies  s < t. 

Proof  (due  to  A.  Jain) . Let  the  nonlinear  variables  be  fixed  at  their 
optimal  values.  The  remaining  problem  is  a linear  program  for  which  a 
basic  solution  exists  (s  =0).  The  result  follows  trivially. 

Thus  in  many  cases  s can  be  guaranteed  to  remain  small. 

Using  the  partitioning  given  by  equation  (l 2),  Property  1 becomes*. 


' 6-1  " 

fA*i1 

r T 
B1 

0 ‘ 

6e 

+ G 

O 

& 

= 

T 

B2 

0 

■ \ ' 

-S3  - 

K'V 

XI 

<3 

i 

bt 

L 

I _ 

. ^2 

and  Property  2 becomes: 


‘B1  B2  V 

'*»l' 

Ax 

0 0 1 

£2 

Ax, 

L ~3  . 

(13) 


(lb) 


where  Ax^  and  have  been  partitioned  corresponding  to  the  partitioning 
of  A (and  the  subscript  k,  referring  to  the  iteration  number,  dropped 
for  convenience). 

From  (l4)  we  have 
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8 


(22) 


*2 


which  is  analogous  lo  the  vector  of  reduced  costs  in  linear  programming. 

The  third  result  from  equation  (13),  following  pre-multiplication 
by  the  matrix  (l8),  is  an  expression  for  the  appropriate  step: 


[-WT  I 0]  G 


-W 

l 

0 


Ax 

~2 


-h 


(85) 


where 

h = [-wT  i o]6  = g2  - = sg  - b2tt  . (2k) 


The  form  of  equation  (23)  suggests  that 
[-WT  I 0]  G 

can  be  regarded  as  a "projected"  (or  "reduced")  Hessian  ana  h » [-W  I 0]g 

r*> 

a projected  gradient,  with  (23)  giving  a Newton  step  in  the  independent 
variables  Ax^,  Note  that  lull  = 0 becomes  a necessary  condition  for  a 
stationary  point  in  the  current  set  of  active  constraints,  which,  if 
the  projected  Hessian  is  nonsingular,  implies  that  jjAx^j!  = 0. 

While  the  above  derivation  is  based  on  the  two  properties  which 
characterize  variable -metric  projection  algorithms,  the  resultant  relations 
could  be  equally  regarded  as  a reduced-gradient  method  in  which 
the  number  of  independent  variables  is  reduced  to  s,  the  dimension  of 
AX5.  We  will  not  spend  time  here  discussing  the  relationship  between 
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the  two  seemingly  different  approaches,  but  refer  the  reader  to  two  review 
papers  by  Fletcher  [17]  and  Sargent  [49]. 

Recently  Gill  and  Murray  [25]  have  considered  a class  of  algorithms 
in  which  the  search  direction  along  the  surface  of  active  constraints  is 
characterized  as  being  in  the  range  of  a matrix  Z which  is  orthogonal 

A A 

to  the  matrix  of  constraint  normals.  Thus,  if  Ax  = b is  the  current 
set  of  n-s  active  constraints,  Z is  an  n x s matrix  such  that 


AZ  = 0 . (26) 


In  the  notation  of  [25],  the  main  steps  to  be  performed  each  iteration 
are  as  follows.  (They  generate  a feasible  descent  direction  p. ) 

A.  Compute  the  projected  (reduced)  gradient  gA  = ZTg. 

B.  Form  some  approximation  to  the  projected  Hessian,  viz. 

G.  A zTGZ 
A • 


C.  Obtain  an  approximate  solution  to  the  system  of  equations 


T T 

Z GZ^  = -Z  g 


(27) 


by  solving  the  system 


ga2a  = 


D.  Compute  the  search  direction  p = Zpft. 
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E. 


Perform  a line-search  to  find  an  approximation  to  a , where 


min  f (x  + o^) 

a 


[x  + op  feasible) 

/V  «"W 


Apart  from  having  full  column  rank,  equation  (?6)  is  (algebraically) 
the  only  constraint  on  Z and  thus  Z may  take  several  forms.  The 
particular  Z corresponding  to  our  own  procedure  is  of  the  form 


■ -w  ‘ 

r -b“:V  1 

1 2 

z = 

I 

= 

1 

0 _ 

0 

(28) 


This  is  a convenient  representation  which  we  will  refer  to  for  exposition 
purposes  in  later  sections,  but  we  emphasize  that  computationally  we  work 
only  with  Bg  and  a triangular  (LU)  factorization  of  B^. 

For  many  good  reasons  Gill  and  Murray  [25]  advocate  a Z whose 
columns  are  orthonormal  (Z^Z  = i).  The  principal  advantage  is  that 
transformation  by  such  a Z does  not  introduce  unnecessary  ill-conditioning 
into  the  reduced  problem  (see  steps  A through  I)  above,  in  particular 
equation  (27)).  The  approach  has  been  implemented  in  programs  described 
by  Gill,  Murray  and  Picken  (e.g.  [27])  in  which  Z is  stored  explicitly 
as  a dense  matrix.  Extension  to  large  sparse  linear  constraints  would 
be  possible  via  an  LDV  factorization  (see  Gill,  Murray  and  Saunders  [29]) 
of  the  matrix  [B^  Bg ] : 

[B1  Bg]  = [L  0]DV 
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where  L is  triangular,  D is  diagonal  and  D r/2V  is  orthonormal,  with 
L and  V being  stored  in  product  form.  However  if  Bg  has  more  than 
1 or  2 columns,  this  factorization  will  always  be  substantial^  more  dense 
than  an  LU  factorization  of  Thus  on  the  grounds  of  efficiency  we 

proceed  with  the  Z in  (28).  At  the  same  time  we  are  conscious  (from 

_ « 4-hat  B must  be  kept  as  well-conditioned 

the  unwelcome  appearance  of  S-j.  ) tna z ^ mutou  * 

as  possible. 
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3.  Implementation 


The  basic  ideas  were  presented  in  the  previous  section;  their 
actual  implementation  in  a computer  code  requires  a good  deal  more  effort. 
The  code  itself  is  a Fortran  program  called  MINOS*  which  is  designed 
to  be  almost  machine -independent  and  to  operate  primarily  within  main 
memory.  The  central  part  of  MINOS  is  an  efficient  implementation  of  the 
revised  simplex  method  which  incorporates  several  recent  advances  in 
linear  programming  technology.  These  include: 

1.  Fast  input  of  the  constraint  data  in  standard  MPS  format**  using 
hash  tables  (in  particular,  the  method  of  Brent  [6])  for  storing 
row-names  and  distinct  matrix  coefficients. 

2.  Compact  in-core  storage  of  the  constraint  matrix  A using  an  elementary 
version  of  Kalan's  super-sparseness  techniques  [36]. 

3.  Upper  and  lower  bounds  on  all  variables. 

4.  A version  of  Hellerman  and  Rarick' s "bump  and  spike"  algorithm  P^  [35l 
for  determining  a sparse  LU  factorization  of  the  basis  matrix  B^. 

5.  Imbedding  of  non-spike  columns  of  L within  A. 

6.  Stable  updating  of  the  LU  factors  of  B^  .by  the  method  of  Bartels 
and  Golub  [ 2 ],  [ 3 ] as  implemented  by  Saunders  [52]. 

7.  An  improved  "CHUZR"  procedure#  for  phase  1 of  the  simplex  method, 
following  ideas  due  to  Rarick  [48  ] and  Conn  [10  ] . 


MINOS  (my'-noss)  = a Modular  In-core  Nonlinear  Optimization  System. 

This  is  the  CONVERT  data  format  described  in  user's  manuals  for  the 
IBM  systems  MPS/36O,  MPSX  and  MPSX/3?0. 


*■>** 

The  block-triangular  structure  of  Bi  is  currently  being  found  using 
subroutines  MC13  and  MC16  from  the  Harwell  Subroutine  Library  (Duff  [l4]. 
Duff  and  Reid  [15]).  Hellerman  and  Rarick' s P3  [32]"  is  then 
applied  to  each  block. 


# 


Implemented  by  J.  A.  Tomlin. 
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For  optimization  of  the  reduced  function  we  have  implemented 

T 

a quasi-Newton  procedure  using  the  factorization  G^  = R R (R  upper 

T 

triangular)  to  approximate  Z GZ.  This  parallels  the  methods  described 
by  Gill  and  Murray  [21],  [22],  Gill,  Murray  and  Bitfield  [28]  which  are  based 
on  the  Cholesky  factorization  = LDL  (L  lower  triangular , D diagonal). 
Stable  numerical  methods  based  on  orthogonal  transformations  are  used  for 
modifying  R during  unconstrained  steps  and  for  certain  other  modifications 
to  R whenever  the  basis  matrices  B.^  and  change.  (Operations  on 
R rather  than  L and  D are  somewhat  easier  to  implement  and  involve 
little  loss  of  efficiency  in  this  context. ) 

Another  module  which  is  fundamental  to  the  success  of  the  present 
algorithm  is  an  efficient  and  reliable  line-search.  The  particular  routine 
used  is  a Fortran  translation  of  Gill  and  Murray's  Algol  60  procedure 
delinsearch,  which  uses  successive  cubic  interpolation  with  safeguards 
as  described  in  [24],  This  routine  evaluates  the  objective  function  and 
its  gradient  simultaneously  when  required.  We  have  left  just  one  parameter 
available  to  the  user  to  change  at  his/her  discretion,  namely,  eta 
(0.0  < eta  < 1.0)  which  controls  the  accuracy  of  the  search.  This 
flexibility  has  proved  to  be  very  satisfactory  in  practice. 
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3.X.  Summary  of  procedure 


An  outline  of  the  optimization  algorithm  is  given  in  this  section; 
some  of  the  finer  points  of  implementation  are  discussed  in  later  sections. 
Assume  we  have  the  following: 

T 

(a)  A feasible  sector  x = Xg  x^]  satisfying  [B1  B^  B^]x  = b, 
l < x < u. 

/V  _•  tst  — /V 

(b)  The  corresponding  function  value  f(x)  and  gradient  vector 

Sq  * 

(c)  The  number  of  superbasic  variables,  s (0  < s < n-m). 

(d)  A factorization,  LU,  of  the  m x m basis  matrix  B^. 

m 

(e)  A factorization,  RE,  of  a variable-metric  approximation  to  the 

T 

s X s matrix  Z GZ. 

T 

(f ) A vector  j r satisfying  B^JT  = g.^. 

T 

(g)  The  reduced  gradient  vector  h = - B^ir. 

(h)  Small  positive  convergence  tolerances  TOLRG  and  TOLDJ. 

Step  1.  (Test  for  convergence  in  the  current  subspace) 

If  ||h!|  > TOLRG  go  to  step  3. 

Step  2.  ("PRICE",  i.e.  estimate  Lagrange  multipliers,  add  one  superbasic) 

(a)  Calculate  ^ = “ B~ir  . 

(b)  Select  A < -TOLDJ  (A  > + TOLDJ),  the  largest  elements  of 
\ corresponding  to  variables  at  their  lower  (upper)  bound. 

If  none,  STOP;  the  Kuhn-Tucker  necessary  conditions  for  an 
optimal  solution  are  satisfied. 
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(c)  Otherwise, 

(i)  Choose  q = q^  or  q - corresponding  to 

(ii)  add  a as  a new  column  of  B_; 

~q  2 

(iii)  add  "\  as  a new  element  of  h: 

q 

(iv)  add  a suitable  new  column  to  R. 

(d)  Increase  s by  1. 

Step  3»  (Compute  direction  of  search,  p = Zp  ) 

(a)  Solve  R^Rpg  = -h . 

(b)  Solve  LUp^  = "B2®°  ’ 

V 

*2  ' 

o _ 

Step  4.  (Ratio  test,  "CHUZR") 

(a)  Find  ol  > 0,  the  greatest  value  of  a for  which  x + ap 
is  feasible. 

(b)  If  amx  = 0 go  to  step  7. 

Step  5.  (Line-search) 

(a)  Find  a,  an  approximation  to  a > where 

f (x  + a *£,)  = min  f (x  + dp)  . 

0<  0 <a 

—raiax 

(b)  Change  x to  x + a£  and  set  f and  g to  their  values 
at  the  new  x. 


(c)  Set  p = 


16 


Step  6.  (Compute  reduced  gradient,  h - 'A1  g) 

T T 

(a)  Solve  U L 7T  = g^. 

T 

(b)  Compute  the  new  reduced  gradient,  h ~ gg  - B^jr. 

T 

(c)  Modify  R to  reflect  some  variable -metric  recursion  on  R R, 
using  a,  Pg  and  the  change  in  reduced  gradient,  h - h. 

(d)  Set  h = h. 

(e)  If  a < ex  go  to  step  1.  No  new  constraint  was  encountered 

fllclX 

so  we  remain  in  the  current  subspace. 


Step  7.  (Change  basis,  delete  one  superbasic) 

Here  a - a and  for  some  p (0  < p < m+s)  a variable 
max  - 

corresponding  to  the  p-th  column  of  [ Bg]  has  reached  one  of 
its  bounds. 

(a)  If  a basic  variable  hit  its  bound  (o  < p < m), 

(i)  interchange  the  p-th  and  q-th  columns  of 


' B, 

r i 

1 

and 

2 

T 

T 

~1 

x. 

(ii) 
(iii ) 


respectively,  where  q is  chosen  to  keep  non- 
singular (this  requires  a vector  T which  satisfies 

T T \ 

ULT  = e : 


modify  L,  U,  R and  jr  to  reflect  this  change  in  B^; 

T 

compute  the  new  reduced  gradient  h = gg  - B^tt  . 


(b ) If  a superbasic  variable  hit  its  bound  (m  ■'  p <-  m+s ) , define 
q = p-m. 


L? 


(c)  Make  the  q-th  variable  in  B0  nonbasic  at  the  appropriate 
bound,  thus: 

(i)  delete  the  q-th  columns  of 


r 1 

' R ' 

c. 

T 

and 

, T 

j 

. h . 

(ii)  restore  R to  triangular  form. 

(d)  Decrease  s by  1 and  go  to  step  1. 

3.2  Work  per  iteration 

The  work  involved  in  one  pass  through  the  above  procedure  is 
roughly  equivalent  to 

(a)  one  iteration  of  the  revised  simplex  method  on  a linear  program  of 
dimensions  m X n,  plus 

(b)  one  iteration  of  a quasi-Newton  algorithm  on  an  unconstrained 
optimization  problem  of  dimension  s. 

Note  that  the  ERICE  operation  (Step  2)  is  performed  only  when  ||h||  is 
sufficiently  small,  which  means  an  average  of  about  once  very  5 iterations. 
This  is  a typical  frequency  in  commercial  LP  systems  using  multiple  pricing. 
The  extra  work  involved  in  the  quasi -Newton  steps  is  somewhat  offset  by 
the  fact  that  a basis  change  (Step  7(a))  occurs  only  occasionally,  so  the 
growth  of  nonzeros  in  the  UJ  factors  of  B^  is  minimal.  Thus  if  s 
is  of  reasonable  size  and  if  f (x)  and  g(x)  are  inexpensive  to  compute, 
iterations  on  a large  problem  will  proceed  at  about  the  same  rate  as  if 
the  problem  were  entirely  linear. 
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3 . 3 Updating  the  Matrix  Factorizations 


As  in  the  simplex  method..,  o stable  factorisation  of  the  basis 

matrix  is  important  for  solving  equations  of  the  form  B^y  = b or 
T 

B-jj.  = c.  Here  we  use  an  implementation  of  the  method  of  Bartels  and 
Golub  [ 2 3,  [33  for  updating  the  factorization  = LU.  Details  are 
given  in  Saunders  [52].  We  normally  re-factorize  every  50  iterations 

regardless  of  the  number  of  modifications  that  have  been  made  to  L and 

U. 


The  remainder  of  this  section  is  devoted  to  the  methods  used  for 

T T / 

modifying  R in  the  approximation  R R ~ Z GZ  whenever  x and/or  Z 

change.  The  notation  R will  be  used  to  represent  R after  any  particular 

modification.  To  ensure  stability,  all  modifications  to  R have  been 

implemented  using  elementary  orthogonal  matrices  Q.  (plane  rotations) 

J 

whose  non-trivial  elements  are 


where 


s . 

a 


3-5.1.  Variable -metric  updates 

Any  of  the  usual  updating  formulas  (e.g.  Davidon  [ 13],  Fletcher 
and  Powell  ( 18} , Broyden  [73)  can  te  used  to  account  for  a nonzero  change 
in  the  superbasic  variables  (Step  6).  The  two  we  have  experimented  with  arej 


The  Complementary  DFP  formula 


-T-  T 1 T 1 T 

COMDFP:  R R = R R + XX  + 

CSC  Eg  Eg 


The  Rank- one  Formula: 

-T-  T IT 
RAMI:  R R - R E + ~ — ww 

0&  Eg  ~ 

where  y = h-h,  the  change  in  reduced  gradient,  and  w = y + oh. 
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The  COMDFP  formula  can  be  used  on  both  constrained  and  unconstrained 
steps  (<7  - and  a < resp. ).  An  alternative  is  to  use  RANK1 

on  constrained  steps  as  long  as  it  results  in  a positive  definite  recursion, 
otherwise  COMDFP.  Systematic  testing  may  perhaps  reveal  a slight  advantage 
for  one  strategy  over  another,  but  in  the  interest  of  simplicity  we  now 
use  COMDFP  in  either  case. 

If  a - O'  and  a is  very  small  it  is  possible  that  the 
max  max 

computed  value  of  y will  be  meaningless.  Following  the  suggestion  of 
M.  J.  D.  Powell  (private  communication)  we  allow  for  this  by  monitoring 
the  change  in  directional  derivative  and  modifying  R only  if 

> 0.9  h\  • 

T 

The  same  test  is  used  even  if  a < Since  h p^  < 0,  this  means 

that  R is  modified  if 


which  will  normally  be  true  if  a value  eta  <0.9  is  given  to  the 
parameter  of  procedure  delinsearch,  which  uses  | t|  | < eta  as  one  criterion 
for  a successful  search.  (Note  that  g,  p,  = g,  Zp,,  - h pg. ) 

Both  COMDFP  and  RANK1  are  implemented  by  means  of  the  following 
routines: 


-T-  T T 

R1ADD:  R R = R R + w 

-T-  T T 

R1SUB:  RE  = R R - w 


These  use  forward  and  backward  sweeps  of  plane  rotations  respectively, 
as  described  in  Saunders  [51,  Ch.  7],  Gill,  Golub,  Murray  and  Saunders  [20  ]. 


3*3*2  Basis  change  (step  7 la)) 


Suppose  that  the  p-th  basic  variable  is  interchanged  with  the  q-th 
superbasic  variable.  Once  R has  been  updated  to  account  for  the  move 
which  is  causing  the  basis  change  (Step  6),  a further  "static"  update  is 
required  to  allow  for  a corresponding  change  in  the  definition  of  Z. 

The  relationship  between  the  new  null- space  matrix  and  the  old  is  given  by 


Z = Z(l  + e vT) 


(29) 


where 


is  the  q-th  unit  vector  and  y 


is  defined  by  the  equations 


T 

Bi5P 


y 


T 


V = - rr  (y  + e ) 


Derivation  of  this  result  is  rather  lengthy  but  the  quantities  involved 
are  easily  computed  and  they  serve  several  purposes: 

1.  The  j-th  element  of  jr,  viz. 


yi  = yTe.  = A e = tffe.e) 

0 ~ 'd?  ~p  1 2~j 


is  the  pivot  element  that  would  arise  if  the  j-th  column  of  B2  were 
selected  for  the  basis  change.  Hence  y can  be  used  as  a guide  for 
dor* mining  q.  Broadly  speaking , the  condition  of  B^  will  be 
preserved  as  well  as  possible  if  y is  the  largest  available  pivot 
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element  (assuming  the  columns  of  Brj  have  similar  norm).  In 

d 

practice  it  is  reasonable  to  relax  this  condition  slightly  in  favor 
of  choosing  a superbasic  variable  that  is  away  from  its  bounds.  Thus 
we  define  q by  the  following: 


y = maxjy . 1 
Jmax 


d,  = min  |x  - £ f , |x.  - u.j 

d {x.  superbasic)  J J d d 
0 


d = max{d. | |y . | > 0.1  y 1 
q 0 3 - max  ’ 


This  rule  is  numerically  more  reliable  than  that  suggested  by  Abadie  [ 1 ] , 
which  in  the  above  notation  is  equivalent  to  maximizing  1 y j I cL j * 

2.  7T  can  be  used  to  update  the  vector  tt  that  is  computed  in  Step  6(a). 
(after  the  last  move  but  before  the  current  basis  change).  Thus 

ir  = tt  + (h  /y„  )r 

where  h^  is  the  appropriate  element  of  the  reduced  gradient  h in 
Step  6(b).  This  is  the  updating  formula  suggested  by  Tomlin  [5*0 
for  use  within  the  simplex  method.  Nonlinearity  is  irrelevant  here 
since  the  basis  change  is  simply  a redefinition  of  Z. 

J.  IT  can  also  be  used  to  update  the  LU  factors  of  B,  (see  Tomlin  [5*0, 
Goldfarb  ( 3l] ) . 

The  modification  to  R corresponding  to  equation  (29)  is 
accomplished  as  follows: 
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R1PR0D:  RTR  = (I  + vej;)  RrR(l  + e vT) 

If  r is  the  q-th  column  of  R,  this  expression  may  be  written 
~q 

RTR  = (RT  + vrT ) (R  + r vT) 

~q~ 

A partial  backward  sweep  of  plane  rotations  ( j - q>  q-l>  . . • > l) 

reduces  r to  a multiple  of  e, , and  then  a full  forward  sweep  restores 
~q  ~1 

R to  triangular  form. 


3.3.3  Removal  of  one  superbasic  variable  (Step  7(c)) 

Removal  of  the  q-th  superbasic  variable  implies  deletion  of  the 

corresponding  column  of  R.  The  resulting  upper-Hessenberg  matrix  is 

restored  to  triangular  form  R by  a partial  forward  sweep  of  plane 

rotations  n (j  = q+i,  ...  , s): 

0 


*R  with 

R 

' 

DELCOL: 

Q • • • 0 0 x 

© V2  Vl 

q-th  column 
deleted 

- 

= 

0 
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3-3-^  Addition  of  one  superbasic  variable  (Step  2(c)) 


When  a vector 


Z = [Z  z] 


a is  added  to  B-, 
~q  2 


where  z = 


the  new  null-space  matrix  is 


r r.“l 
B.,  a 
1 ~q 


0 


Following  Gill  and  Murray  ([25  ] i PP-  76-77)  we  approximate  the  vector 
Gz  by  finite  differences , thus : 


£(x  + Sz)  - g(x) 

X = 5 = G£  + } 


where  5 is  a small  step  in  the  direction  z,  for  example,  6 = e'^/||z|| . 

The  following  procedure  can  then  be  used  to  generate  a new  column  for  R: 

( 


ADDCOL: 


Solve 
{ Compute 

Take 


RTr  = ZTv 


a = - \kf  > P “ \a 


l/2 


R = 


* £ 
p 


T ~x 

(Note  that  z v is  best  computed  as  the  last  element  of  Z v rather  than 

from  z and  v directly. ) 

Comparison  of 


T 


-T- 

Sk  = 


. r P 

u~ 


R r 

rsj 

P 


■ T 
RE 

T i 

Z v 
/-*/ 

T 

v Z 

T 

z V 

ns/  ns/J 

and 


2k 


rzTi 

r T 

Z GZ 

T 1 

h oz 

G [Z  z]  = 

T 

z GZ 

. rs> 

zTGz 

/V  <v_ 

-T  - 

Z GZ 


T T 

shows  that  if  R R provides  a good  approximation  to  Z GZ 

-T-  -T  - 

then  R R has  some  chance  of  being  a useful  approximation  to  Z GZ. 

The  main  work  involved  here  is  in  computing  the  gradient  vector 

g(x  + 5z)  and  the  projection  ZTv.  This  work  is  essentially  wasted  if 

the  expression  for  o is  not  positive , which  may  happen  for  many  reasons; 
-T  - 

e.g.  if  Z GZ  is  not  positive  definite  at  the  current  point,  if  R is 
a poor  approximation  or  if  R is  very  ill-conditioned.  In  such  cases  we 

t 2./ p 

set  r ~ 0 and  take  p to  be  either  (z  v)  ' or  1.0,  thus: 


(30) 


R = 


R 0 
P 


(30) 


One  advantage,  at  least,  is  that  the  subsequent  search  direction  will 
move  the  new  superbasic  variable  xq  away  from  its  bound,  so  there  is 
no  danger  of  cycling  on  xq. 

With  many  problems  the  condition  o < 0 occurs  only  occasionally 

or  not  at  all.  Computing  r and  p as  shown  then  leads  to  significantly 

fewer  iterations  than  if  (30)  were  used  all  the  time.  On  the  other  hand, 

° > 0 is  not  a sufficient  condition  for  success.  In  particular  if  the 

current  point  is  near  a singularity  in  g(x)  the  difference  approximation 

to  Gz  is  unlikely  to  be  good.  (An  example  is  when  f(x)  has  terms  of 

the  form  x.  log  x.  and  the  constraints  include  bounds  such  as  x.  > 10~ 1 
d J J 

In  such  cases,  r and  p prove  to  be  consistently  very  large,  resulting 


.) 
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in  an  R which  is  much  more  ill-conditioned  than  R.  Subsequent  iterations 
make  little  progress  until  the  associated  quasi-Newton  updates  restore 
the  condition  of  R.  In  contrast,  use  of  (30)  with  p = 1.0  gives  rapid 
progress. 

Let  d and  d . be  the  largest  and  smallest  diagonals  of  R. 
max  mxn 

As  a heuristic  means  of  detecting  the  above  situation  we  monitor  ||v|| 
and  resort  to  (30)  whenever  ||v]|  is  significantly  large  than  d or 

nisix 

-T 

smaller  than  d (As  a side  benefit,  the  expense  of  computing 

and  r is  then  avoided. ) A final  similar  test  is  made  on  p. 

In  contrast  to  all  previous  discussion,  the  ADDCOL  procedure  just 
described  embodies  a discernible  level  of  ad  hoc  strategy.  However  our 
experience  with  it  has  been  good  in  general,  and  the  combined  use  of 
R1PR0D,  DELCOL  and  ADDCOL  is  almost  certainly  better  than  resetting  R = I 
at  every  change  to  the  set  of  active  constraints. 
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3A  Convergence  tests 


Another  area,  in  which  strategy  plays  an  important  practical  role 
is  in  deciding  when  to  stop  optimizing  in  the  current  subspace  and  consider 
moving  away  from  one  of  the  active  constraints.  Here  we  must  enlarge  on 
the  use  of  TOLRG  in  Section  3*1}  recall  that  in  Step  1 of  the  algorithm, 
TOLRG  was  tested  to  determine  if  it  was  time  to  compute  Lagrange  multipliers 
(reduced  costs,  and  add  one  more  superbasic  variable. 

Suppose  that  after  a particular  iteration  we  have 

Ax^  = the  change  in  the  superbasic  variables 

Af  = the  change  in  f 

7[  = the  new  pricing  vector 
T 

h = Z g,  the  new  reduced  gradient 
Cf}  ^IiRG;  eg  = positive  scalars 
e = machine  precision 


and  let  T^  be  a set  of  tests  (with  values  true  or  false)  defined  as 
follows : 

V 1^11  < (ex  + ) (l  ' ||^||) 

T2:  M < Uf  + e)(i  + |f  | ) 

Ty  ||h||  < TOLRG 

v Nl  < egltell 
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In  place  of  the  simple  test 


if  Tj  then  compute  X , 

the  following  combined  test  is  used: 

if  (Tx  and  Tg  and  T^)  or  then  compute  X . 

The  general  form  of  this  test  follows  that  used  in  the  algorithm  lcmna 
of  Gill,  Murray  and  Picken  [273 > in  which  the  scalars  identified  here  by 
e , €-,  TOLRG  and  € are  fixed  at  certain  "loose"  values  initially  and 

XX  2 

are  then  reset  to  "tight"  values  once  it  appears  that  the  optimal  set  of 
active  constraints  has  been  identified.  Use  of  e and  e„  in  this 

X JL 

way  is  justified  in  the  sense  that  it  seems  reasonable  to  remain  on  the 

present  set  of  active  constraints  as  long  as  significant  progress  is 

being  made.  Use  of  e in  T.  allows  for  the  possibility  that  the 

g " 

last  step,  though  significant,  may  have  moved  x very  close  to  an 

optimum  in  the  current  subspace  (e.g.  the  quasi-Newton  procedure  should 

achieve  this  regularly  if  f(x)  is  quadratic). 

In  adopting  the  above  strategy  we  have  found  it  beneficial  to  vary 

TOLRG  dynamically.  In  the  current  version  of  MINOS  this  is  done  as 

follows.  Suppose  that  the  "best"  Lagrange  multiplier  at  some  stage  is 
T 

X<1  5 gq  " ~ V ^ corresPond^nS  variable  x^  becomes  superbasic, 
the  reduced  gradient  for  the  expanded  subspace  will  be 

fh  1 


Now  recall  from  equation  (2l)  that  unless  h is  reasonably  small,  even 
one  further  iteration  could  change  tt  and  hence  X^  significantly. 
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Therefore  as  a safeguard  (which  is  admittedly  heuristic)  we  accept  A. 
and  move  into  the  new  subspace  only  if  ||h||  < 0.9|A  | , which  implies 

os  q 


IlfelL  < 0.9I&L . 


We  then  reset  TOLRG  for  the  new  subspace  to  be 


TOLRG  - Tlg||h||w 


where  t]  € (0,1)  is  a parameter  which  is  available  to  the  user  to  set 
S 

at  his  own  will  (and  peril! ).  A typical  value  is  i)  = 0.2  and  its  function 

E 

is  analogous  to  that  of  the  parameter  eta  in  procedure  delinsearch.  For 


example  a small  value  of  t)  allows  the  user  to  insist  on  an  accurate 

S 

optimization  within  each  subspace. 
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!<•.  Use  of  first  and  second  derivatives 


We  have  assumed  throughout  that  the  gradient  g(x)  is  available 

and  we  have  avoided  explicit  use  of  the  Hessian  matrix  G(x).  Some 

discussion  of  alternatives  is  in  order.  The  principal  factor  here  is 

T 

the  expense  of  transforming  even  one  vector  by  Z or  Z . In  fact  if 
the  constraint  matrix  A has  many  rows,  most  of  the  work  per  iteration 
lies  in  computing  p = Zp^  and  h = Z g.  (These  calculations  are 
analogous  to  the  FTRAN  and  BTRAN  operations  in  linear  programming. ) 

1.  When  g is  not  available  it  would  often  be  practical  to  form  an 
approximation  g using  finite  differences  along  the  coordinate 
directions,  e.g., 

. f(x  + Se.)  - f (x) 

= i • 

(The  number  of  g^'s  to  be  computed  this  way  is  equal  to  the  number  of 

nonlinear  variables.)  Just  one  transformation  with  ZT  is  then 

T~ 

required,  viz.  h ~ Z g. 


2.  An  alternative  which  is  normally  viable  would  be  to  difference  f (x) 

fv 

along  the  directions  z . : 

~J 


A.  + SzJ  - f(x) 

V ^ ~&ahJ 


T 


where  z^  = Ze^ , j = 1,  . . . , s . Unfortunately  this  approach  is  not 
practical  for  large  problems,  since  storage  limitations  prevent  saving 
all  s vectors  Zy  and  the  work  involved  rules  out  recomputing 
them  when  required. 
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3.  If  g(x)  and  perhaps  G(x)  are  available,  the  system  of  equations 


T T 

Z GZpg  = ••  Z\ 


(31) 


eould  sometimes  be  treated  by  a modified  Newton  method  (Gill  and 

Murray  [233;  Gill,  Murray  and  Picken  [27]).  This  involves  either 
T 

computing  Z GZ  directly: 

m m 

Z GZ  = [z.Gz .] 

or  differencing  g(x)  thus: 


v. 

~a 


g(x  + 5z,. ) - g(x) 

s 


> 


ZTGZ  « i (ZTV  + VTZ)  . 


However  the  need  for  the  vectors 
culties  for  large  problems. 


z . 


~0 


again  presents  severe  diffi- 


4.  If  G is  large  and  sparse,  equation  (31)  could  sometimes  be  solved 

iteratively  by  the  method  of  conjugate  gradients  (e.g.  see  Gill  and 

Murray  [25;  P-  133 3 ) • Storage  is  minimal  since  the  method  avoids 

T 

forming  the  matrix  Z GZ  or  any  approximation  to  it.  However  if  Z 
s columns  the  method  usually  requires  o(s)  products  of  the  form 
ZT(g(Zv)). 

5.  A final  (more  promising)  alternative  is  to  abandon  equation  (31) 
and  to  generate  a search  direction  by  a nonlinear  conjugate-gradient 
type  method  such  as  that  of  Fletcher  and  Reeves  [193  (e.g.  see  Gill 
and  Murray  ([25  3,  p.  13^))'  This  takes  the  form 


has 
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(a)  h = -z\  • 

(b)  if  restart  then  p^  = -h 

£Lii  Eg  = “&  + PSg 

(c)  £ = 2^2 

where  p^,  p^  are  the  previous  and  current  search  directions  for  the 
superhasics.  Several  methods  have  been  suggested  for  determining  the 
scalar  {3,  e.g. 

Fletcher  and  Reeves  [19]:  f3  = ||h||2/||h||2 

Polak.  and  Ribiere  [46]:  (3  = h^(h  - h)/]|h||2 

Perry  [45].  p = hT (h  - h - a^)/j^(h  - h) 

In  MINOS,  a switch  is  made  to  one  of  these  methods  if  the  number  of 
superbasics  s becomes  larger  than  the  dimension  specified  for  the 
matrix  R.  A restart  occurs  whenever  the  set  of  active  constraints 
changes}  also  every  s+1  iterations  in  the  (rare)  event  that  more  than 
s consecutive  steps  are  unconstrained.  More  refined  restart  procedures 
(e.g.  Powell  [47])  will  require  future  investigation.  In  the  present 
environment  the  above  formulas  for  P have  all  performed  rather  similarly 
(though  seldomly  as  well  as  quasi-Newton).  An  example  is  given  in  §5.2.4. 
To  summarize:  the  reduced-gradient  approach  allows  maximum  efficiency 
in  dealing  with  large  sparse  linear  constraints,  but  at  the  same  time  it 
alters  our  perspective  on  the  relative  merits  of  Newton,  quasi-Newton  and 
conjugate  gradient  methods  for  handling  the  nonlinear  objective.  Strangely 
enough,  even  if  the  exact  Hessian  matrix  were  available  (no  matter  how  sparse) 
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we  could  not  afford  to  use  it.  In  this  context  we  find  that  quasi-Newton 
methods  take  on  a new  and  unexpected  importance.  The  storage  required  for 
the  Hessian  approximation  is  often  moderate  even  when  there  are  many  linear1 
or  nonlinear  variables,  as  long  as  the  total  number  of  superbasic  variables 
is  of  order  100  (say)  or  less.  Otherwise,  a conjugate-gradient  method 
remains  the  only  viable  alternative. 


4.1.  Quadratic  programs 

The  above  statements  do  not  hold  if  G happens  to  be  a constant 
matrix.  In  this  case  the  relation 

RTR  = ZTGZ  (32) 

T 

can  often  be  maintained  exactly  without  recomputing  Z GZ  every  iteration. 

Such  a specialization  has  been  described  by  Gill  and  Murray  [26],  along 

T 

Ath  the  measures  required  to  allow  for  Z GZ  being  indefinite.  The 
present  quasi-Newton  algorithm  could  be  specialized  as  follows: 

1.  Initialize  R at  the  start  of  a run  to  satisfy  (32) . (This  is  trivial 
if  there  are  no  superbasics;  it  may  not  be  possible  for  an  arbitrary  set 
of  superbasics  since  Z GZ  could  be  indefinite. ) 

2.  In  procedure  ADDCOL  (§3.3.4)  compute  the  vector  v = Gg  directly 
rather  than  by  differencing  the  gradient. 

3.  Suppress  the  variable-metric  updates  to  R (COMDFP  and  RANK1  in  §3.3.1) 
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However  it  is  worth  noting  that  the  difference  approximation  to  v = Gz 
will  he  essentially  exact,  so  that  if  (32)  ever  holds  at  any  stage  then 
ADDCOL  will  maintain  (32)  almost  exactly  when  a column  is  added  to  Z. 

A step  a = 1,0  along  the  next  search  direction  will  then  move  x i?o  the 
new  subspace  minimum.  Now  it  is  easily  verified  that  the  subsequent 
variable  metric  updates  will  cause  no  net  change  to  R (ignoring  slight 
rounding  error  in  the  case  of  COMDFP).  The  scene  is  therefore  set  for 
another  exact  minimization  during  the  next  iteration. 

The  above  sequence  will  be  broken  if  a constraint  forces  some 
step  a to  be  less  than  1.0.  The  variable -metric  updateq  will  then  alter 
R slightly,  (32)  will  cease  to  hold  and  the  next  subspace  minimization 
may  require  more  than  one  iteration.  In  certain  applications  this  could 
be  undesirable,  but  more  generally  the  robustness  and  self -correcting 
properties  of  quasi-Newton  methods  offer  compensating  advantages,  including 
the  ability  to  start  with  any  matrix  R (such  as  I).  Suffice  to  say 
that  the  general  algorithm  comes  close  to  being  "ideal"  on  quadratic  programs, 
without  undue  inefficiency  or  any  specialized  code. 


r/ . Computational  Experience 


Although  the  prime  application  of  this  research  is  to  large-scale 
linear  programs  with  a nonlinear  objective  function,  we  have  endeavored 
to  attack  a comprehensive  range  of  problems  to  aid  development  of  the 
algorithm.  It  is  unfortunate  that  large-scale  nonlinear  problems  are  not 
widely  reported  in  the  literature,  so  that  many  of  the  results  discussed 
here  refer  to  problems  which  are  solely  within  the  authors*  own  purview. 

A brief  description  of  each  problem  is  given.  Fuller  details  of 
constraint  data,  starting  points,  etc.  must  be  left  to  a future  report. 

Three  of  the  starting  options  provided  in  MINOS  are  as  follows: 

1.  (CRASH)  A triangular  basis  matrix  is  extracted  from  the  matrix  A, 
without  regard  to  feasibility  or  optimality.  The  number  of  superbasic 
variables  is  set  to  zero. 

2.  (initialization  of  nonlinears)  The  user  specifies  values  for  any  number 
of  the  nonlinear  variables.  These  are  made  superbasic.  CRASH  is  then 
applied  to  the  linear  variables  in  A. 

5.  (Restart)  A previously-saved  bit-map  is  loaded  (specifying  the  state 
of  all  variables),  along  with  values  for  any  superbasic  variables. 

This  allows  continuation  of  a previous  run,  or  an  advanced  start  on  a 
different  but  related  problem  (for  example  the  bounds  £ < x < u 
may  be  changed). 

Options  2 and  5 normally  reduce  run  time  considerably,  but  the 
results  reported  here  were  obtained  using  the  "cold  start"  option  1 unless 
otherwise  stated.  A normal  phase  1 simplex  procedure  was  used  to  obtain 
an  initial  feasible  solution. 
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5.1. 


Description  of  Test  Problems 


1.  Colville  No.  1.  This  is  problem  no.  1 in  the  Colville  series  of  test 
problems  [ 9 ].  ^Tie  objective  is  a cubic  function  of  5 variables. 

2.  Colville  No.  7.  This  is  a quartic  function  of  16  variables. 


3«  Chemical  Equilibrium  Problem.  This  particular  example  of  the  chemical 
equilibrium  problem  was  obtained  from  Himmelblau  [54] , problem  6.  The 
objective  is  of  the  form 


f(S>  x3k(V*fn<V?xi*,)1 

K j 1 


(Note.  Slight  corrections  were  made  to  the  constraint  data  in  [54,  p.  401], 
The  group  of  coefficients  [-1, -2, -5, -4)  in  column  15  was  moved  to 
column  l4,  and  a similar  group  in  column  12  was  moved  to  column  15.) 

4.  Weapon  Assignment  Problem.  This  problem  appeared  originally  in  Bracken 
and  McCormick' s book  on  nonlinear  programming  applications  [ 5 L and 
more  recently  in  Himmelblau  [54],  problem  25«  The  objective  function  is 

20  5 x. . 

f(x)  = £ u.(  n a.^  - l) 

~ j-l  3 i=l  ^ 


with  unknowns 
be  integers. 


x.  . > 0. 
10  “ 


We  have  ignored  the  requirement  that  the  x.  . 


5.  Structures  Optimization  (Q. P. ).  This  is  a series  of  quadratic 
programming  problems  in  structures  design  [58]. 


6.  Oil  Refinery  Investment  Model.  This  is  typical  of  many  linear  programming 
based  oil  refinery  models,  but  has  the  added  feature  that  nonlinear 
returns  to  scale  of  capital  equipment  costs  are  defined  explicitly. 

The  particular  problem  cited  in  the  results  has  15  nonlinear  variables 


of  this  kind. 
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7.  Energy  Submodel.  A related  research  project  on  the  development  a 

national  energy  model  [433  has  given  rise  to  a fairly  complex  submodel 
of  the  electricity  sector.  The  2k  nonlinear  variables  are  mainly  the 
capacities  of  the  different  types  of  generating  equipment. 


8.  Expanded  Energy  System  Model.  An  expanded  model  which  covers  all  aspects 
of  energy  production  and  distribution  on  a national  level  has  been 
developed  [533-  This  is  a medium-scale  linear  program  with  9l  nonlinear 
variables  in  the  objective;  again  these  are  mainly  nonlinear  returns 
to  scale  of  capital  equipment  costs  of  the  form 


91 

Z 

i=l 


c.x. 

l l 


with  0 < p.  <1  (around  0.6  to  0.7). 


9.  Energy  Model  RS8.  This  is  a lb-period  energy  model  which  was  formulated 
from  the  outset  as  a nonlinear  programming  problem  (see  Manne  [38],  [393 )• 
The  objective  is  of  the  form 

16  a. 

Z — + linear  terms 


with  one  pair  of  nonlinear  variables  x^,  y^  for  each  time  period 
(those  for  the  first  two  periods  being  known).  This  was  the  first  large 
problem  available  to  us  and  is  of  interest  for  several  reasons.  In 
particular  it  provides  a comparison  with  a (considerably  larger)  linear 
approximation  to  the  problem,  in  which  each  term  a^/x^:  was  discretized 
over  a two-dimensional  grid.  Further  details  are  given  in  §5*2.2. 


10.  Energy  Model  ETA  (Manne  [ 40  3 ) - This  is  a further  development  of  the 

previous  model.  The  objective  is  the  same  as  in  RS8  with  the  addition 

of  E}6,  z?  for  16  variables  z.. 
i=l  i l 


5.2.  Results 

The  results  summarized  in  Table  1 were  obtained  on  a Burroughs 

—11 

B6700  computer  using  single-precision  arithmetic  (e  ^10  ).  The  standard 

time  ratios  quoted  are  relative  to  tne  processor  time  required  for  a standard 
timing  program  given  in  Colville  [ 9 ] . The  standard  time  for  unoptimized 
B6700  Fortran  is  83. 07  seconds. 

The  results  in  Table  2 onwards  were  obtained  using  double  precision 
arithmetic  on  an  IBM  370 /l68  (e  « lO’"1'5).  The  standard  time  for  this 
machine  with  the  IBM  Fortran  IV  (H  extended^  compiler  with  full  optimization 
is  3.92  seconds.  A fairly  accurate  line-search  was  normally  used  (eta  = 0.01) 
and  the  quantity  ||h||/||m||  was  reduced  to  10 or  less  at  optimality. 
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5.2.1.  The  chemical  equilibrium  problem  (problem  3) 


This  example  provided  useful  experience  in  dealing  with  logarithmic 
singularities  in  g(x) . The  objective  consists  of  functions  of  the  form 

f = s x.8l 

whose  gradient  components  are 


x. 

g.  = c.  + In  — - — . 

L x . 

3 3 

If  some  x^  is  zero,  the  corresponding  term  in  f may  be  correctly 

programmed  as  (x.jg^)  = 0*  However,  g^  itself  is  then  analytically  minus 

infinity  (unless  all  x.  =0),  and  any  particular  numerical  value  given 

0 

to  it  in  the  gradient  subroutine  will  result  in  a discontinuity  in 

as  x^  moves  (even  slightly)  away  from  zero.  To  avoid  this  difficulty 

we  ran  the  problem  with  a uniform  lower  bound  = 10  on  all 

variables,  for  various  values  of  k in  the  range  4 to  10.  (The  problem 

~3 

is  infeasible  with  x.  >10  .)  Results  are  summarized  in  Table  3, 

0 

where  each  run  continued  from  the  run  before  using  starting  option  3. 

The  minimal  change  in  f(x)  is  typical  of  dual  geometric  programs, 
but  values  x.  = 10  and  x.  = 10  (say)  have  very  different 

0 U 

physical  interpretations  and  therefore  warrant  more  than  the  usual 
degree  of  resolution. 

In  Table  4 we  list  the  largest  solution  value  x^  and  the 
8 smallest  values  in  the  order  by  which  they  became  superbasic.  The 
most  significant  variation  is  in  x^,..  Most  values  have  stabilized 
by  the  time  k reaches  10. 

4l 


A lower  bound  on  the  condxtion  number  of  the  projecte* 
of  the  ratio  of  the  largest  and  smallest  diagonals  of 
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For  interest,  the  last  row  of  Table  4 shows  the  values  obtained 

by  the  program  SUMT  as  reported  by  Himmelblau  l 34] . There  appear  to  be 

no  accurate  significant  digits  in  the  8 smallest  x..  (This  may  be  due 

0 

to  differences  in  the  constraint  data,  errors  in  satisfying  the  general 

constraints,  or  simply  different  machine  precisions.) 

Note  that  when  x . is  small  the  diagonal  elements  of  the 
3 

Hessian  matrix  are  dg./Sx.  = 0(l/x.).  However  these  large  elements 

J 3 3 

affect  the  reduced  Hessian  only  when  x.  is  basic  or  superbasic. 

3 

The  safest  strategy  for  these  problems  therefore  appears  to  be  the 
following: 

/ v -4 

(a)  Solve  the  problem  with  relatively  large  lower  bounds,  e.g.  x.  > 10 

3 

A near-optimal  objective  value  will  be  obtained  quickly  because 
the  reduced  Hessian  remains  reasonably  well-conditioned. 

(b)  Reduce  the  lower  bounds,  perhaps  in  stages,  to  0(e ^2)  or  0(e2/^) 
There  will  be  essentially  no  further  basis  changes,  and  in  roughly 
descending  order  the  small  will  leave  their  bounds  one  by  one 
to  become  superbasic. 

-4  -10 

Solution  of  problem  3 with  x.  > 10  followed  by  x.  > 10  required 

3 3 

a total  of  103  iterations  and  452  function/gradient  evaluations  as 
shown  in  Table  2.  Solution  with  x.  > 10-10  directly  required  188 

t) 

ite~ations  and  886  evaluations,  primarily  because  the  Hessian  approxima- 
tion became  very  ill-conditioned  before  a near-optimal  point  was  reached. 

As  a natural  precaution  against  rounding  error  the  linesearch 
procedure  delinsearch  avoids  evaluating  f(x  + Op)  with  values  of 
a that  are  very  close  together.  On  the  IBM  370/168  this  prevented 
resolution  below  lO**'1'0,  although  for  this  special  case  f (x^  could 
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easily  be  evaluated  using  higher  precision  arithmetic.  The  limiting 
factor  would  then  become  the  condition  of  the  reduced  Hessian. 

5.2.2.  Energy  model  RS8 

Problem  9&  in  Table  2 refers  to  the  original  linearized  version 
of  the  energy  model,  in  which  each  term  of  the  form 

f(x,y)  = 

xy 

was  approximated  over  a 6 x 6 grid.  It  has  twice  as  many  columns  and 
matrix  coefficients  as  the  nonlinear  version  9b.  Note  that  construction 
of  the  small  but  reasonably  fine  grid  required  good  prior  estimates 
of  the  optimal  values  for  the  l4  (x,y)  pairs. 

Run  9b  is  included  to  illustrate  the  rather  poor  performance 
that  could  be  encountered  during  early  " de-bugging"  of  a nonlinear  problem. 
Some  relevant  facts  follow. 

(a)  The  bounds  on  nonlinear  variables  were  conservative  in  the  sense 
that  the  lower  bounds  were  far  removed  from  the  optimal  solution 
values  and  there  were  no  upper  bounds. 

(b)  No  attempt  was  made  to  initialize  the  nonlinears  at  reasonable 
values  between  their  bounds. 

(c)  The  y variables  proved  to  be  badly  scaled. 

To  enlarge  on  the  last  point,  the  Hessian  matrix  of  f(x,y)  above  is 
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G(x,y) 
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and  it  follows  from  the  diagonal  elements  of  the  triangular  factor  that 
G has  a condition  number  k( G)  > y /2x  . Now  the  optimal  values  for 
the  x and  y variables  are  all  0(l)  and  0(100)  respectively, 
which  might  normally  be  considered  well-scaled;  however  it  means  that 
k(G)  is  at  least  O(lO^),  which  in  this  case  is  unnecessarily  large. 
Replacing  each  y by  a variable  y = y/lOO  gave  a significant  improve- 
ment as  shown  by  run  9c  in  Table  2. 


5.2.3.  Energy  model  ETA 

It  is  in  runs  lOa-lOc  that  the  real  benefits  from  a nonlinear 
optimizer  become  apparent.  This  is  an  example  of  the  model-builder's 
standard  mode  of  operation  wherein  numerous  runs  are  made  on  a sequence 
of  closely  related  problems  with  the  solution  from  one  run  providing 
a starting  point  for  the  next.  Here,  problem  10a  (the  base  case)  was 
solved  from  a cold  start  with  certain  variables  fixed  at  zero;  for 
run  10b  the  bounds  were  relaxed  on  1 6 of  these  variables,  and  for  run 
10c  a further  10  variables  were  freed,  (in  this  particular  sequence 
the  starting  solutions  for  10b  and  10c  were  clearly  feasible.  This  is 
desirable  but  not  essential. ) 
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Compared,  to  solving  linearized  approximations  by  standard  linear 
programming,  some  of  the  obvious  advantages  are: 

1.  reduced  problem  size; 

2.  reduced  volume  of  output  (in  the  absence  of  a report  writer) j 

3.  ability  to  prepare  data  for  several  runs  in  advance,  since  there 
are  no  grid  variables  to  be  moved  or  refined; 

4.  the  solution  obtained  actually  solves  the  correct  problem. 
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5-2.4.  Comparison  of  Quasi-Newton  and  Conjugate  Gradients 

The  weapon  assignment  problem  (no.  4)  was  chosen  here  as  a 

reasonably  small  but  nontrivial  example.  About  6o  changes  in  the 

active  constraint  set  occur  during  the  iterations. 

The  parameters  being  varied  are 

t)  = linesearch  accuracy  tolerance  (eta  in  §3); 

Tj  = the  tolerance  for  minimization  within  each  subspace  (see  §5.4). 
s 

Recall  that  small  values  of  these  parameters  mean  accurate  minimization. 

For  Table  5 we  set  t]  =0.5  and  compared  the  normal  Quasi-Newton 

6 

algorithm  with  each  of  the  conjugate  gradient  algorithms  for  various 
values  of  rj.  We  find  that  Quasi-Newton  is  consistently  superior  and 
is  quite  robust  with  respect  to  diminishing  linesearch  accuracy,  in 
contrast  to  the  conjugate  gradient  (eg)  algorithms.  Unfortunately 
there  is  no  discernible  trend  that  singles  out  one  eg  algorithm  over 
another. 


| 
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For  Table  6 the  same  runs  were  made  with  t|  =0.01.  (A  more 

s 

accurate  subspace  minimization  makes  the  sequence  of  constraint 
changes  more  consistent  between  runs.)  This  smoothed  out  the  iteration 
and  function-evaluation  counts,  but  again  there  is  no  evidence  to  favor 
any  particular  eg  algorithm. 

To  illustrate  that  the  eg  methods  are  not  to  be  discarded 

immediately,  in  Figure  1 we  have  plotted  the  value  of  f(x)  against 

iteration  number  for  the  second  row  and  first  two  columns  of  both 

Tables  5 and  6.  Thus  a reasonably  accurate  linesearch  was  used  for 

all  cases  (r)  = 0.01).  Curves  1 and  2 compare  Quasi-Newton  with 

Fletcher-Reeves  using  q =0.5,  and  curves  3 and  4 do  the  same  with 

6 

= 0.01. 

The  first  two  curves  show  smooth  progress  for  both  methods. 

Note  that  although  the  eg  method  lags  behind  it  has  essentially 
identified  the  final  set  of  active  constraints  by  the  time  the  Quasi- 
Newton  method  converges  (iteration  139).  The  step-function  shape  of 
curves  3 and  4 illustrates  the  work  that  is  wasted  in  converging  to 
minima  within  each  subspace.  Otherwise  these  curves  effectively  place 
a magnifying  glass  on  the  tail  end  of  the  other  runs.  The  terminal 
convergence  of  the  eg  method  is  clearly  very  slow  and  it  is  here  that 
better  restart  procedures  such  as  in  Powell  [47  ] should  prove  to  be 
most  valuable. 
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n 

Quasi 

-Newton 

Fletcher-Reeves 

Polak-Ribiere 

Perry 

0.001 

123 

375 

226 

840 

' 222 

806 

198 

713 

0.01 

139 

255  i 

223 

728 

237 

770 

259 

849 

0.1 

122 

281  { 

227 

671 

238 

709 

228 

665 

0.2 

137 

300  [ 

250 

721 

252 

749 

218 

578 

0.3 

148 

291 

239 

648 

248 

688 

307 

8l4 

0.4 

156 

289 

282 

742 

296 

853 

309 

762 

0.5 

153 

242 

275 

695 

394 

1079 

612 

l4n 

0.9 

207 

256 

694 

987 

>999 

>2748 

818 

968 

Table  5.  Iterations  and  function  + gradient  evaluations 

for  the  weapon  assignment  problem;  rj  =0.5; 

8 

various  linesearch  tolerances  rj. 


*1 

Quasi 

-Newton 

Fletcher-Reeves 

Polak-Ribiere 

Perry 

0.001 

220 

615 

! 493 

1628 

44o 

1514 

440 

1495 

0.01 

219 

548 

498 

1520 

471 

1520 

466 

1476 

0.1 

209 

46i 

560 

1597 

508 

l46l 

530 

1568 

0.2 

218 

445 

! 582 

1589 

531 

1517 

585 

1626 

0.3 

229 

4ll 

t 612 

1557 

634 

1752 

613. 

1625 

0.4 

262 

44l 

« 748 

1831 

691 

1821 

752 

1788 

0.5 

262 

377 

; 691 

1633 

818 

1993 

894 

1974 

0.9 

288 

345 

’>999 

>1855 

>999 

>1658 

>999 

>1156 

Table  6.  Iterations  and  function  + gradient  evaluations 
for  the  weapon  assignment  problem;  q = 0.01 
(more  accurate  minimization  within  each  subspace) . 
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Quasi- 


6.  Comparison  with  other  algorithms 


i 


L ! 


Many  of  the  ideas  discussed  here  were  either  implicit  in  or 

anticipated  by  the  work  of  Wolfe  [ 56] , [ 57] , Faure  and  Huard  [ l6 ] and 

McCormick  [41],  [42].  However  there  have  since  been  such  significant 

advances  in  implementation  techniques  for  the  numerical  methods  involved 

that  there  is  little  point  in  making  detailed  comparisons.  Algorithmically, 

one  important  difference  is  our  emphasis  on  keeping  the  number  of  super- 

basic  variables  as  small  as  possible  and  changing  that  number  by  at 

* 

most  1 each  iteration.  With  the  quasi-Newton  approach,  this  strategy 
retains  maximum  information  about  the  projected  Hessian.  Even  though 
the  proof  of  convergence  [4l]  for  the  variable -reduct ion  method  depended 
on  regular  resetting  of  the  Hessian  approximation,  we  never  set  R = I 
except  at  the  start  of  a run  or  in  the  rare  event  that  the  linesearch 
fails  to  find  an  improved  point  (in  which  case  both  R and  the  true 
Hessian  are  normally  very  ill-condition) . Zig-zagging  is  controlled 
effectively  by  the  tolerance  q and  the  logic  described  in  § 5.4. 

Rates  of  convergence  within  each  subspace  follow  from  analogous  proofs 
for  unconstrained  algorithms. 

Since  the  present  algorithm  possesses  superlinear  convergence 
properties  and  can  handle  rather  arbitrary  sized  problems,  it  should 
be  competitive  with  other  algorithms  designed  specifically  for  quadratic 

In  the  original  reduced-gradient  algorithm  the  set  of  superbasics  was 
effectively  redefined  each  iteration  as  being  the  current  set  plus 
those  nonbasic  variables  whose  reduced  costs  were  of  the  correct  sign. 
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programming  (e.g.  Wolfe  [55],  Beale  (4],  Cottle  [11],  Lemke  (57]). 

In  particular  a comparison  with  Beale's  meth<  i would  be  relevant, 
since  it  is  reported  that  his  method  is  efficient  for  problems  which 
have  a small  number  of  quadratic  terms.  On  general  (unstructured) 
problems  with  m and  n large  it  is  doubtful  that  Wolfe’s  algorithm 
or  the  linear  complementarity  methods  would  compare  well  because  they 
work  with  linear  systems  of  order  m + n rather  than  ra. 

A final  comment  on  problems  which  have  a large  sparse  set  of 
general  constraints  Ax  > b in  relatively  few  variables  (thus  A is 
m x n with  m > n).  Ideally,  methods  designed  specifically  for  this 
case  use  an  active-constraint  strategy  and  avoid  transforming  the  whole  of 
A each  iteration  (e.g.  the  version  of  the  reduced-gradient  algorithm 
in  Wolfe  [57]),  and  the  implementation  of  Buckley  [8]).  The  improved 
efficiency  of  these  methods  is  analogous  to  the  benefit  that  might  be 
realized  in  the  purely  linear  case  if  the  dual  simplex  method  were  applied 
to  the  dual  linear  program.  Nevertheless,  given  the  use  of  sparse- 
matrix  techniques,  solution  by  the  present  (canonical  form)  method  will 
be  quite  efficient  unless  m » n.  In  any  event,  with  n moderate  by 
assumption,  this  i one  class  of  problems  where  the  number  of  superbasic 
variables  (and  hence  the  dimension  of  the  reduced  hessian)  will  always 
remain  manageably  small. 
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7.  Conclusion 


Our  primary  aim  has  been  to  combine  the  simplex  algorithm  with  ■ 
quasi-Newton  techniques  in  an  efficient  and  reliable  computer  code 
for  solving  large,  linearly  constrained  nonlinear  programs.  The  full 
potential  of  conjugate-gradient  methods  in  this  context  remains  to  be 
explored,  but  the  necessary  framework  now  exists;  this  framework  will 
also  accommodate  extension  to  problems  with  a moderate  number  of  non- 
linear constraints  (e.g.  Jain,  Lasdon  and  Saunders  [35]).  In  the 
meantime  the  code  is  applicable  to  an  important  class  of  potentially 
very  large  problems,  viz,  dual  geometric  programs,  and  more  generally 
it  should  provide  a new  dimension  of  utility  to  an  already  substantial 
body  of  large-scale  linear  programming  models. 
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